Accelerating the inference of a trained DNN is a well studied subject. Inthis paper we switch the focus to the training of DNNs. The training phase iscompute intensive, demands complicated data communication, and containsmultiple levels of data dependencies and parallelism. This paper presents analgorithm/architecture space exploration of efficient accelerators to achievebetter network convergence rates and higher energy efficiency for trainingDNNs. We further demonstrate that an architecture with hierarchical support forcollective communication semantics provides flexibility in training variousnetworks performing both stochastic and batched gradient descent basedtechniques. Our results suggest that smaller networks favor non-batchedtechniques while performance for larger networks is higher using batchedoperations. At 45nm technology, CATERPILLAR achieves performance efficienciesof 177 GFLOPS/W at over 80% utilization for SGD training on small networks and211 GFLOPS/W at over 90% utilization for pipelined SGD/CP training on largernetworks using a total area of 103.2 mm$^2$ and 178.9 mm$^2$ respectively.
展开▼
机译:加速训练有素的DNN的推论是一个经过充分研究的主题。在本文中,我们将重点转移到DNN的训练上。训练阶段是计算密集型的,需要复杂的数据通信,并且包含多个级别的数据依赖性和并行性。本文提出了一种有效的加速器的算法/架构空间探索,以实现更好的网络收敛速度和训练DNN的更高能效。我们进一步证明,具有对集体通信语义的分层支持的体系结构在训练各种执行基于随机和批量梯度下降技术的网络时提供了灵活性。我们的结果表明,较小的网络更喜欢非批处理技术,而使用批处理操作的较大网络的性能更高。在45纳米技术下,CATERPILLAR在小型网络上进行SGD培训时,利用率超过80%时达到177 GFLOPS / W的性能效率,在大型网络上使用管道SGD / CP培训时使用211 GFLOPS / W的效率超过90%,使用的总面积为103.2 mm $ ^ 2 $和178.9 mm $ ^ 2 $。
展开▼